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1  a  ABSTRACT(M«*»«n 

This  paper  describes  the  training  of  discrete-time  dynamic  systems  with  adaptive 
parameters  (recurrent  neural  networks)  using  first-order  gradient  descent  algorithms. 

To  facilitate  the  explanation  of  these  algorithms,  a  standard  representation  of  a 
discrete— time  dynamic  system  is  defined.  Any  differentiable  discrete  dynamic  system 
may  be  put  in  this  standard  representation  and  trained  using  a  gradient  descent 
algorithm.  Using  the  standard  representation,  we  described  two  general  types  of 
learning  algorithms.  The  first  is  based  upon  the  discrete-time  Euler-Lagrange 
equations,  and  the  second  is  based  upon  a  recursive  update  of  the  output  gradients. 

Both  the  epochwlse  and  on-line  versions  of  these  algorithms  are  presented.  When  the 
dynamic  system  is  implemented  by  a  neural  network,  the  epochwise  algorithm  based  on  the 
Euler-Lagrange  equations  is  equivalent  to  backpropagation-through-tlme  and  the  on-line 
method  based  on  the  recursive  equation  is  the  same  as  recursive  backpropagation.  It  is 
shown  that  the  epochwise  versions  of  the  algorithms  are  equivalent.  The  two  on-line 
versions  of  the  algorithms  are  shown  to  be  approximately  equivalent.  Because  of  the 
equivalence  of  the  algorithms,  selection  of  an  appropriate  gradient  descent  algorithm 
is  based  solely  upon  computational  efficiency  and  storage  requirements.  Accordingly, 
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a  discussion  of  these  two  properties  of  the  algorithms  is  included. 
To  Illustrate  the  differences  between  the  algorithms  and  the  useful¬ 
ness  of  the  standard  representation,  two  examples  are  Included. 


1  Introduction 


The  ability  of  humans  to  control  and  interact  with  a  complex  environment  has  motivated  our  study  of 
adaptive  discrete-time  dynamic  systems  which  are  fully  or  at  least  partially  composed  of  neural  networks. 
Humans  regularly  perform  certain  tasks  easily  and  proficiently  which  have  proven  difficult  to  reproduce  with 
machines.  Examples  of  such  tasks  include  driving  a  car,  recognizing  the  differences  between  a  cat  and  a  dog, 
and  understanding  spoken  language.  Humans  learn  how  to  accomplish  these  tasks  in  part  by  interacting  with, 
manipulating  and  eventually  controlling  their  environment.  In  order  to  build  machines  which  accomplish 
these  difficult  tasks,  it  may  be  necessary  both  to  mimic  humans  learning  through  environmental  interaction 
and  to  model  the  low  level  functions  of  the  human  nervous  system.  The  algorithms  presented  in  this  paper 
provide  one  possible  technique  for  accommodating  both  of  these  requirements.  These  algorithms  train 
neural  networks,  which  model  the  low  level  functions  of  the  human  nervous  system,  by  interacting  with  a 
user-specified  environment. 

An  adaptive  dynamic  system  which  is  composed  of  a  neural  network  and  a  set  of  equations  which  describe 
the  environment  may  be  used  as  an  adaptive  model  of  the  interaction  of  a  human  with  its  environment.  In  this 
model,  the  neural  network  receives  the  state  of  the  environmental  as  input.  Using  the  state  in  combination 
with  a  desired  goal,  the  neural  network  outputs  a  control  signal  which  manipulates  the  environment.  Using 
the  training  algorithms  presented  in  this  paper,  system  performance  often  becomes  strongly  human-like.  The 
first  example  of  Section  9  gives  a  prime  example  of  this  phenomenon. 

Although  the  human  ability  to  control  its  environment  motivates  us,  our  primary  interest  lies  in  the 
development  of  engineering  tools  rather  than  in  the  modeling  of  humans.  The  training  of  discrete-time 
dynamic  systems  has  applications  in  the  engineering  fields  of  pattern  recognition,  nonlinear  control,  adaptive 
control  and  adaptive  digital  filtering.  It  is  in  these  areas  that  the  material  in  this  paper  will  be  of  most 
immediate  use. 

Currently,  the  theory  on  first-order  gradient  descent  training  of  adaptive  discrete-time  dynamic  networks 
is  described  in  separate  and  unrelated  terms  in  several  papers  by  different  authors  [1,2, 3, 4,5, 6].  Our  goal 
is  both  to  bring  together  in  a  coherent  manner  and  to  expand  the  theory  on  this  subject.  A  coherent 
presentation  of  the  subject  is  achieved  by  deriving  the  learning  algorithms  using  a  standard  representation  of 
a  dynamic  system.  The  derivation  of  the  generalized  forms  of  the  existing  algorithms  provides  for  extensions 
of  current  algorithms.  Bringing  together  and  generalizing  the  theory  in  this  manner  should  facilitate  the 
selection  of  appropriate  gradient  descent  algorithms  for  problems  requiring  discrete-time  recurrent  networks. 
It  should  be  noted  that  a  forthcoming  paper  by  Williams  euid  Zipser  [7]  contains  a  detailed  discussion  of 
adapting  dynamic  neural  networks,  whereas,  this  paper  presents  the  concepts  of  first-order  gradient  descent 
for  systems  composed  both  fully  or  partially  of  neural  networks. 

The  paper  is  composed  of  ten  sections.  Section  2  introduces  notation  and  the  standard  representation  for 
a  discrete-time  dynamic  system.  Section  3  presents  the  differences  between  on-line  2ind  epochwise  training. 
An  introduction  to  gradient  descent  training  of  static  systems  is  given  in  Section  4.  Section  5  presents  the 
algorithms  used  for  adapting  discrete-time  dynamic  systems.  The  equivalence  of  the  algorithms  is  presented 
in  Section  6.  A  comparison  of  the  computational  emd  storage  requirement  of  the  edgorithms  is  included  in 
Section  7.  Two  techniques  of  speeding-up  the  on-line  training  algorithms  are  discussed  in  Section  8.  Section  9 
presents  two  applications  which  illustrate  the  usefulness  of  the  theory  discussed  in  the  paper,  and  Section  10 
provides  a  conclusion. 


2  System  Definition 

In  this  section,  a  standard  representation  of  any  discrete-time  dynamic  system  is  proposed.  This  representa¬ 
tion  is  used  in  the  derivation  of  the  learning  algorithms.  In  addition,  the  ordered  derivative,  which  simplifies 
the  calculation  of  derivatives  of  complex  systems,  is  introduced. 
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2.1  The  Standard  Representation 

Let  k  denote  the  iteration  of  a  discrete-time  dynamic  system,  with  1  =  0  representing  the  first  iteration  of 
the  system.  A  standard  representation  of  discrete-time  dynamic  system  is  illustrated  in  Figure  1.  The  input 
of  the  system  at  iteration  k  is  defined  by  two  components.  The  first  component,  Rt,  is  composed  of  an 
external  input  vector,  r*,  and  the  previous  J  delayed  versions  of  this  vector,  rt_i, . . .  ,rk-j-  The  external 
input  vector  is  a  length  M  column  vector,  rt  €  Therefore,  the  external  inputs  to  the  system  including 

delayed  inputs  is  a  length  (J  +  l)Af  column  vector  of  the  form  R*  =  [rj’,rj’_j, . . .  e 

The  second  component,  Y*,  is  made  up  of  the  previous  L  output  vectors.  The  output  vector  at  iteration 
1  is  a  length  N  column  vector,  yt  6  Therefore,  Yk  is  a  length  LN  column  vector  of  the  form 

Y*  =  [yt_i>yt_2>  ■  •  • 

At  iteration  k,  let  the  adaptive  parameter  vector,  which  we  shEdl  refer  to  as  the  weight  vector,  be  selected 
from  a  set  of  weight  vectors.  In  general,  this  set  of  weight  vectors  is  generated  by  the  training  algorithm  as 
discussed  in  Section  3.  Assuming  a  weight  vector  to  be  a  length  Q  column  vector,  the  weight  set  W  has 

the  form  W  =  {W(0),  W(l) . W(i),.. .}.  The  use  of  the  j*''  weight  vector  at  the  1*'*  iteration  shall  be 

denoted  Wt(j).  Finally,  let  Wk{i)  denote  any  weight  of  the  vector  Wt(i).  Of  course,  Wk{i)  is  a  scalar. 

By  denoting  any  element  of  a  discrete-time  dynamic  system  which  is  connected  to  a  delay  as  an  output 
and  including  it  in  yk,  any  discrete-time  system  can  be  written  as 


y*  =  /(Rt,Yt,W*(i)) 


(1) 


where  the  function  /  contains  no  delay.  Thus  the  schematic  representation  in  Figure  1  can  be  used  to 
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Figure  2;  Neural  Controller  and  Plant  in  the  Standard  Representation. 

describe  any  discrete-time  dynamic  system.  The  first-order  gradient  descent  algorithms  to  be  described  in 
this  paper  are  defined  in  terms  of  the  standard  representation  shown  in  Figure  1.  Because  any  adaptive 
discrete-time  dynamic  system  can  be  put  in  the  standard  representation,  such  systems  can  be  trained  using 
the  algorithms  described  in  this  paper  provided  that  the  output  vector,  yt,  is  differentiable  with  respect  to 
the  weight  vector,  Wjb(t),  and  the  recurrent  input,  Y^.  The  need  for  this  requirement  will  become  evident 
in  Section  5. 


2.2  An  Example  of  a  System  in  the  Standard  Representation 

In  order  to  illustrate  the  use  of  the  standard  representation,  an  adaptive  discrete-time  dynamic  system  which 
consists  of  a  neural  network  and  an  environment  model,  is  shown  schematically  in  Figure  2.  In  this  figure, 
we  use  the  traditional  term  plant  model  instead  of  environmental  model.  We  define  the  dynamic  system  of 
Figure  2  to  be  a  neural  network  controller-plant  system. 

Because  we  often  use  the  controller-plant  system  to  illustrate  important  points  throughout  this  paper,  it 
is  useful  at  this  point  to  discuss  this  system  in  greater  detail  The  plant  model  may  take  two  different  forms. 
The  most  general  is  simply  a  set  of  equations  which  map  previous  states  and  current  control  to  the  next 
state,  Yk  =  fp(uii,Yk),  where  Yt  are  the  previous  states  of  the  plant  and  ut  is  the  control  signal  vector. 
If  the  equations  of  the  model  are  nonlinear,  adapting  the  structure  of  Figure  2  results  in  a  neural  controller 
which  implements  a  nonlinear  state  feedback  control  law.  This  technique  provides  a  method  of  designing  a 
nonlinear  controller  for  a  nonlinear  plant. 

Using  the  second  form  of  a  plant  model,  a  neural  network  model,  has  applications  in  the  field  of  adaptive 
control  [3,5,8].  In  this  case,  the  plant  model  takes  the  form  y*  =  /p(ufc,  Y*,  W**),  where  is  the  weight 
vector  of  the  neural  network  model.  This  model  can  be  updated  on-line  using  plant  input-output  data. 
Because  the  algorithms  presented  in  this  paper  use  the  plant  model  to  update  the  controller,  the  neural 
network  controller  can  adapt  to  changes  in  the  plant. 

2.3  Ordered  Partial  Derivatives 

Because  we  are  interested  in  using  first-order  gradient  descent  to  update  the  adaptive  weights,  we  are  required 
to  calculate  a  partial  derivative  of  the  associated  dynamic  system.  The  ordered  partial  derivative,  which 
is  a  special  partial  derivative  for  an  ordered  set  of  equations,  provides  a  mathematical  technique  for  easily 
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finding  derivatives  of  complex  dynamic  systems  [6]. 

In  order  to  discuss  the  ordered  derivative,  we  must  first  introduce  the  concept  of  an  ordered  set  of 
equations.  Let  { , . . . ,  r* . . . . ,  Zj , . . . ,  Zn  }  be  a  set  of  n  variables  whose  values  are  determined  by  a  set  of  n 
equations.  This  set  of  equations  is  defined  to  be  an  ordered  set  of  equations  if  each  variable  r,-  is  a  function 

only  of  the  variables  {ri . Thus,  the  equation  for  any  variable  of  an  ordered  set  of  equations  can 

be  written  as 


-i  =  /(^i . 2x-l) 

Because  of  the  ordered  nature  of  this  set  of  equations,  the  variables  {^i,. . .  must  be  calculated  before 

Zi  can  be  computed. 

When  calculating  the  partial  derivative  of  a  variable  it  is  necessary  to  specify  which  variables  are  held 
constant  and  which  are  allowed  to  vary.  Typically,  if  this  is  not  specified,  it  is  assumed  that  all  variables 
are  held  constant  except  those  terms  appearing  in  the  denominator  of  the  partial  derivative.  This  is  the 
convention  we  have  adopted  in  this  paper. 

The  ordered  partial  derivative,  which  is  defined  only  for  variables  of  an  ordered  set  of  equation,  is  a  partial 
derivative  whose  constauit  and  varying  terms  are  determined  using  the  ordered  set  of  equations.  The  constant 
terms  of  the  order  partial  derivative  of  Zj  with  respect  to  Zi,  which  is  denoted  d'^Zj /dzi  in  order  to  distinguish 
it  from  an  ordinary  partial  derivative,  are  {zi, . . . ,  z<_i}.  The  varying  terms  are  {zi, . . .  ,Zj, . . . ,  z„}. 

The  ordered  derivative  is  usually  found  using  either  of  two  chain  rule  expansions.  The  first  expansion, 
which  is  expressed  in  vector  form  as 

dzi  dzi  ^  dzt  dzi 

was  shown  by  Werbos  in  his  thesis  (6).  The  proof  of  the  second  chain  rule  expansion 

dzi  “  5z<  ^  dzk  dzi  ^  ’ 

uses  arguments  similar  to  those  used  to  prove  the  first  expansion. 

Finally,  one  comment  on  mathematical  notation,  throughout  this  paper  it  is  assumed  that  a  partial 
derivative  of  the  form,  da/db,  where  a  €  and  b  G  is  a  matrix  of  the  form 


3  Epochwise  and  On-Line  Training 

Any  adaptive  algorithm  adjusts  the  parameters  of  a  system  so  that  the  system  responds  to  a  set  of  inputs  in 
some  desired  manner.  First-order  gradient  descent  algorithms  accomplish  this  goal  by  minimizing  an  error 
function.  The  definition  of  this  error  function  is  dependent  upon  whether  the  system  is  operating  in  an 
epochwise  or  on-line  mode.  In  this  section,  both  epochwise  and  on-line  training  are  defined.  The  epochwise 
and  on-line  error  functions  as  well  as  their  associated  weight  update  equations  are  also  presented. 

3.1  Epochwise  Training  Algorithms 

An  epoch  is  a  forward  iteration  of  the  dynamic  system  from  iteration  k  =  0  to  kj,  where  kj  is  the  final 
iteration.  An  epochwise  training  algorithm  is  any  algorithm  in  which  training  takes  place  after  each  epoch 
or  a  series  of  epochs  of  the  dynamic  system. 

In  order  to  use  the  gradient  descent  epochwise  algorithms,  an  error  must  be  defined.  It  is  common  for 
this  error  to  be  a  function  only  of  the  outputs  of  the  dynamic  system,  {yc . Yk,)-  and  a  set  of  desired 
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output  vectors,  {do . d*^},  with  d*  €  The  desired  output  vectors  and  the  set  of  inputs  vectors 

associated  with  the  desired  output  vectors  are  given  in  a  training  set.  This  set  is  composed  of  P  elements 

with  element  p  taking  the  form  {Rop,  Yop.rip, . . .  .r*^p.  dip _ ,dt^p}.  It  should  be  noted  that  the  desired 

response  need  not  be  defined  for  each  iteration,  only  for  the  final  iteration  k/.  A  commonly  used  epochwise 
error  function  is 


P  , 
p=ot=o  ^ 

which  is  the  sum  of  the  squared  error  over  the  entire  training  set.  The  error  of  Equation  4  is  calculated 
using  an  ordered  set  of  equations.  Because  of  the  ordering  of  this  set,  the  error  is  always  the  last  calculation 
performed  in  this  set. 

Utilizing  gradient  descent,  the  epochwise  algorithms  presented  in  this  paper  update  the  weights  using 

Q+E 

+  =  (5) 

where  p,  the  learning  rate,  is  a  suitably  chosen  positive  constant.  The  update  rule  generates  a  new  weight 
vector  W(i  +  1)  from  the  vector  W(i).  If  the  error  function  defined  in  Equation  4  is  used,  the  weights  are 
updated  after  cycling  through  the  training  set.  Therefore,  if  weight  vector  W(l)  is  used  for  the  first  cycle 
through  the  training  set,  weight  vector  W(i)  is  used  for  the  cycle  through  the  set.  Generally,  the  weight 
vector  of  the  first  cycle,  W(l),  is  randomly  initialized. 

Although  first-order  gradient  descent  provides  the  basis  for  adaptation  of  the  weights,  the  algorithms 
discussed  in  this  paper  have  come  to  be  known  by  the  method  for  which  they  calculate  the  error  gradient  of 
Equation  5.  Hence,  it  should  be  remembered,  that  even  though  backpropagation-through-time  and  recursive 
backpropagation,  two  algorithms  presented  in  Section  5,  have  quite  different  names,  they  both  perform 
first-order  gradient  descent. 

3.2  On-line  Training  Algorithms 

If  the  weight  update  of  an  algorithm  at  the  current  iteration  k'  depends  only  on  the  states  of  the  system 
at  iterations  {k',  k'  ~  l,k'  -2,.. .},  then  the  algorithm  is  defined  to  be  an  on-line  training  algorithm.  The 
implied  dependence  only  upon  the  current  and  past  values  of  the  system  allows  the  weight  updates  to  be 
computed  in  real-time  in  most  cases.  The  key  difference  between  on-line  and  epochwise  training  algorithms 
is  that  an  on-line  algorithm  adapts  the  weights  of  the  system  as  it  runs  while  an  epochwise  training  algorithm 
only  updates  the  weights  after  the  final  iteration.  The  primary  reason  for  using  an  on-line  algorithm  is  that 
as  the  number  of  iterations  in  an  epoch  becomes  very  large,  it  becomes  computationally  inefficient  to  update 
the  weights  only  after  each  epoch.  Therefore,  on-line  algorithms,  which  adapt  the  weights  as  the  system 
runs,  must  be  used. 

In  the  on-line  case,  an  error  is  defined  for  each  iteration.  At  the  current  forward  iteration,  fc',  the  error, 
Eie‘,  is  often  a  function  of  the  desired  response  vector,  dt<,  and  the  output  vector,  y*<.  It  is  common  to  use 
the  on-line  error  function 


Ek'  =  ^(d*.  -  yk'V(dk'  -  Vk')-  (6) 

It  is  well  known  that  in  the  on-line  case,  minimization  of  Equation  6  using  first-order  gradient  descent  at 
each  iteration  results  in  the  minimization  of  the  mean  square  error  [9];  therefore,  using  the  error  defined  by 
Equation  6  minimizes 


E[Ek'\  =  E[^(dk'-yk'/i^t‘-yk')\ 
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Figure  3:  Static  System. 


where  E  is  the  expected  value  operator. 

Often,  on-line  training  algorithms  update  the  weights  at  each  iteration  based  upon  the  gradient  of  the 
error  function.  At  iteration  k\  the  on-line  update  rule  is  expressed  as 


w(k'  -(-  1)  =  w(k')  -  n 


d+Ef 

dw{k>) 


(7) 


where  p  is  a  suitably  chosen  constant  and  £*<  is  the  appropriate  on-line  error  function.  Equation  7  is  usually 
initialized  by  a  random  setting  of  u;(0).  The  application  of  Equation  7  generates  a  new  vector  of  weights 
at  each  iteration.  When  using  an  on-line  algorithm,  it  is  common  for  the  weights  at  the  iteration  k'  to  be 
selected  from  the  vector  W(ib'),  therefore,  wic'{k')  =  w{k'). 


4  Static  System  Algorithms 

In  order  to  facilitate  the  discussion  of  the  training  algorithms  for  discrete-time  dynamic  systems,  it  is  useful 
to  introduce  the  first-order  gradient  descent  algorithms  for  static  systems.  A  static  system  contains  no 
feedback,  therefore,  a  static  system  has  the  structure  shown  in  Figure  3,  where  r  6  7il*^**l  is  the  input 
vector  and  y  €  is  the  output  vector.  A  static  system  can  be  described  by  the  following  equation 


y  =  /(r,W(i)). 

4.1  The  Backpropagation  Algorithm 

As  in  the  dynamic  system  case,  the  first-order  gradient  descent  techniques  for  static  systems  depends  upon 
minimizing  an  error.  In  general,  this  error  is  a  function  of  the  output,  eind  the  output  is  a  function  of  the 
weights.  Therefore,  using  the  chain  rule  of  Equation  2,  the  error  gradient  may  be  expressed  as 

d'^  E  _  E  dy  _  BE  dy 
dw{i)  dy  dw{i)  dy  dw{i) 

Assuming  that  the  appropriate  equations  for  the  error  and  the  output  are  available  and  differentiable,  an 
expression  for  the  error  gradient  with  respect  to  each  weight  can  be  found  by  differentiating  the  error  and 
output  equations. 

If  more  is  known  about  the  structure  of  the  system,  it  is  possible  to  use  this  information  to  decrease  the 
number  of  computations  needed  to  find  the  error  gradients.  The  backpropagation  algorithm  of  Rumelhart 
et  al  [4]  does  precisely  this.  Based  upon  the  fact  that  the  static  system  is  composed  of  a  layered  feedforward 
neural  network,  the  backpropagation  algorithm  efficiently  computes  the  error  gradients  for  such  a  static 
network.  Using  the  backpropagation  algorithm,  the  error  gradient 


d-^E  _  dy 
dw(i)  dw(i) 
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Figure  4:  Static  Neural  Network  Control  System. 


where 


A  = 


dy' 


is  calculated  by  backpropagating  the  vector,  A,  through  the  neural  network.  It  should  be  noted  that  any 
equation  which  takes  the  form  of  Equation  8,  can  be  calculated  using  backpropagation,  provided  the  structure 
of  the  system  is  a  neural  network. 


4.2  Static  Controller-Plant  System 

As  we  shall  next  show,  the  error  gradient  of  a  static  controller-plant  system  takes  the  form  of  Equation  8. 
In  static  neural  control  applications,  the  system  is  composed  of  two  components,  the  controller  and  plant, 
as  shown  in  Figure  4.  The  controller  is  implemented  by  a  multilayered  neural  network  while  the  plant  may 
be  modeled  by  a  neural  network  or  a  set  of  equations.  When  the  plant  is  modeled  using  a  neural  network, 
the  combination  of  the  controller  and  plant  form  a  static  neural  network.  Therefore,  the  error  gradient  with 
respect  to  each  weight  of  the  neural  controller  takes  the  form  of  Equation  8  and  can  be  calculated  using  the 
backpropagation  algorithm. 

If  the  plant  is  modeled  by  a  set  of  equations,  a  differentiation  of  the  plant  equations  with  respect  to  the 
control  vector  in  combination  with  a  backpropagation  can  be  used  to  compute  the  error  gradient.  The  error 
of  the  system  of  Figure  4  is  a  function  of  the  plant  output,  which  is  a  function  of  the  controller  output. 
The  controller  output  is  a  function  of  the  weights  of  the  network.  Therefore,  using  the  ch^n  rule,  the  error 
gradient  can  be  written  as 


d'^E  _  ,  5u 
dw{i)  dw{i) 


(9) 


where 


,  _  dE  dy 
dy  9u 

Equations  8  and  9  have  the  same  form.  Because  u  is  the  output  of  a  neural  network,  the  error  gradients  with 
respect  to  the  controller’s  weights  are  found  by  backpropagating  A'  through  the  neural  network  controller. 
The  backpropagation  term,  A',  is  computed  by  multiplying  dE/dy  by  the  plant  Jacobian  matrix,  dy/du, 
which  is  calculated  from  the  plant  equations. 
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4.3  Training  Neural  Networks  Implemented  on  Chip 

It  is  also  possible  to  train  a  static  feedforward  network  which  is  implemented  on  a  VLSI  chip  using  first-order 
gradient  descent.  In  this  case,  it  is  assumed  that  the  precise  mathematical  equations  of  the  neural  network 
system  are  not  known.  Therefore,  the  derivatives  cannot  be  calculated  mathematically.  Instead,  they  can 
computed  by  introducing  small  perturbations  into  the  hidden  layer  nodes.  The  resulting  perturbation  at 
the  output  divided  by  the  node  perturbation  approximates  the  output  gradient  with  respect  to  the  node. 
This  method,  in  conjunction  with  LMS  [9],  can  be  used  to  calculate  the  output  gradient  with  respect  to  the 
weights.  This  technique  of  calculating  the  gradient  is  known  as  Madaline  Rule  III  (MR  III)  [10].  A  VLSI 
chip  manufactured  by  Intel  supports  this  type  of  training  [11]. 

The  first-order  gradient  techniques  discussed  above  are  used  not  only  for  adapting  static  systems,  they 
are  also  a  key  component  of  the  rules  used  for  adapting  discrete-time  dynamic  systems.  It  is  shown  in  the 
next  section  that  these  techniques  are  required  for  training  dynamic  systems  composed  fully  or  partially  of 
neural  networks. 


5  Algorithms 

In  this  section  two  basic  types  of  algorithms,  Euler-Lagrange  based  algorithms  and  recursive  gradient  update 
algorithms,  both  of  which  are  used  for  training  discrete-time  dynamic  systems,  are  discussed.  The  epochwise 
and  on-line  versions  of  both  these  types  of  algorithms  are  presented.  As  indicated  by  their  name,  the 
Euler-Lagrange  algorithms  are  based  on  the  discrete-time  Euler-Lagrange  equations  [12],  These  equations 
are  used  to  calculate  the  error  gradient  with  respect  to  the  weights.  The  backpropagation-through-time 
algorithm,  which  is  used  for  epochwise  training  of  dynamic  neural  networks,  is  an  example  of  an  Euler- 
Lagrange  based  algorithm.  The  recursive  gradient  update  algorithm  is  based  upon  a  recursive  equation  for 
the  output  gradient  which  is  derived  from  the  dynamic  system  definition.  Equation  1.  The  error  gradient 
is  easily  computed  using  this  output  gradient.  The  recursive  backpropagation  algorithm,  which  is  used  for 
on-line  training  of  dynamic  neural  networks,  is  an  example  of  a  recursive  gradient  update  algorithm.  The 
epochwise  and  on-line  versions  of  edgorithms  based  upon  the  Euler-Lagrange  equations  and  the  recursive 
gradient  update  equation  can  be  used  to  train  a  variety  of  dynamic  systems  which  contaun  neural  networks 
as  will  be  shown  periodically  in  the  remainder  of  this  section. 

Before  deriving  the  Euler-Lagrange  and  recursive  gradient  update  algorithm,  it  is  worth  mentioning  that 
the  stability  of  these  algorithms  cannot  be  guaranteed.  Therefore,  when  using  these  algorithms,  one  must 
constantly  monitor  their  performance.  If  instability  becomes  a  problem  when  using  one  of  these  algorithms, 
it  is  often  necessary  to  change  certain  parameters  of  the  algorithm,  often  the  learning  rate,  to  overcome  the 
problem. 

For  any  given  discrete-time  dynamic  system  problem,  either  an  Euler-Lagrange  based  algorithm  or  a 
recursive  gradient  update  algorithm  can  be  used  to  train  the  system.  In  Section  6,  it  is  shown  that  the 
Euler-Lagrange  based  algorithm  and  recursive  gradient  update  algorithm  compute  approximately  the  same 
error  gradient  for  a  given  problem  in  both  the  epochwise  and  on-line  cjise.  Even  though  the  algorithms  are 
inherently  equivalent,  the  computational  and  storage  requirements  of  the  algorithms  are  different.  Therefore, 
the  .selection  of  the  appropriate  algorithm  for  a  specific  problem  should  be  based  upon  the  computational 
and  storage  requirements.  These  requirements  are  derived  in  Section  7. 


5.1  An  Algorithm  Based  on  the  Euler-Lagrange  Equations 

The  discrete-time  Euler-Lagrange  equations  in  the  calculus  of  variations  provide  a  standard  technique  for 
calculating  the  first-order  gradients  of  an  error  function.  Using  these  equations,  it  is  po.ssible  to  calculate  the 
epochwise  gradient  of  any  discrete-time  dynamic  system  provided  that  the  differentiability  requirements  on 
the  system,  discussed  in  Section  2,  are  met.  When  the  dynamic  system  is  compo.sed  fully  of  a  feedforward 
neural  network,  the  error  gradient  can  be  calculated  using  a  combination  of  the  Euler-Lagrange  equations  and 
the  backpropagation  algorithm.  This  combination  is  the  basis  of  the  backpropagation-through-time  algorithm 
whicli  w;us  first  iiitrodund  by  Werbos  [(>].  A  number  of  researchers  including  Nguyen  and  Widrow  [3], 
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Pearlnuuter  [2]  and  Jordan  [8]  have  successfully  used  the  backpropagation-through-time  algorithm  to  train 
dynamic  networks. 

In  this  section,  the  discrete-time  Euler-Lagrange  equations  are  first  derived.  Next,  an  epochwise  training 
algorithm  which  uses  these  equations  is  discussed.  Finally,  training  of  dynamic  systems  composed  fully  or 
partially  of  neural  networks  using  the  Euler-Lagrange  based  algorithm  is  presented. 


5.1.1  Disci’ete-Tiiiie  Euler-Lagrange  Equations 

In  order  to  use  first-order  gradient  descent,  we  need  to  find  the  error  gradient,  d* Eldu.'(i),  for  any  given 
epoch.  This  gradient  can  be  derived  using  the  first  chain  rule  expansion,  Equation  2,  and  the  following 
ordered  set  of  equations,  which  are  generated  at  each  epoch. 


Wo(.) 

= 

W(.) 

yo 

= 

/(Ro,Yo,Wo(i)) 

Wi(i) 

= 

W(i) 

yi 

/(Ri,Y,,Wi(.)) 

w*(0 

W(i) 

yt 

= 

/(Rt,Yk,W*(i)) 

= 

W(t) 

y*/ 

= 

/(R*„Yt,.Wt,(.)) 

E 

= 

fiyo,y\,--,ykf,do,d 

Using  the  first  chain  rule  expansion  for  an  ordered  system.  Equation  2,  we  can  expand  the  ordered  derivative, 
Eldw{i),  to  obtain 


d^E  ^  dE  lx  fd^E  dyt  d+E  dWt{i)\  , 

9u;(»)  dw(i)  \  5yt  5u;(j)  5Wt(«)  dw{i)  ) 

The  terms  dE/dw(i)  and  dyk/dw(i)  are  equal  to  zero  because  E  and  y*  are  not  a  direct  function  of  w{i)  *. 
Thus,  we  find  the  expansion  of  Equation  10  can  be  written  as 


d^E  d^E  dW,{i)  X  d^E 

dw{i)  dw{i) 

We  need  to  find  an  expression  for  the  term  E/dwt{i).  This  expression  can  be  found  by  expanding  the 
ordered  derivative  using  Equation  2. 


dwt(i) 


I  ^  ^  I  ^ 

«'*(»)  dyj  dwk{i) 


d+E  dV/jii) 
dWj{i)  dwt(i)  ■ 


'  In  our  definition  of  the  partial  derivative,  all  terms  are  held  constant  except  the  terms  in  the  denominator  of  the  partial 
derivative.  Therefore,  if  the  function  which  defines  the  numerator  of  the  partial  derivative  does  not  contain  the  terms  of  the 
denonunator  directly,  then  the  partial  derivative  is  zero. 
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The  terms  dE/dwt^i)  and  dW j(i)/dwk{i)  are  equai  to  zero.  The  term  dyj /dwk(i)  is  nonzero  only  when 
k  =  j.  Using  these  results,  we  find  the  ordered  derivative,  d* E/dwk{i),  to  be 


d*  E  _  d'^  E  dyk 
dwk(i)  ~  dyk  dwk(i)' 

Substituting  Equation  12  into  Equation  11,  the  error  gradient  is 


(12) 


d'^'E  _  ^  d'^E  dyk  _  ^  .  dyk 

dw{i)  dwkii)  ~  dwk{i) 


where 


^k 


d+E 
dyk  ' 


The  term  dyk/dwk(i)  of  Equation  13  is  easy  to  calculate.  The  term  d'^E/dyk  must  still  be  expanded. 
Once  again,  using  the  chain  rule  expansion,  Equation  2,  we  expand,  d'*'  E/dyk,  to  find 


dyk  dyk^  .^^^\dyj  dyk^  dWjii)  dyk  )' 


(14) 


The  term  dWj(i)/dyk  is  equal  to  zero.  The  term  dyj/dyk  is  also  equal  to  zero  when  j  >  k  -j-  L,  where  L 
is  the  maximum  number  of  delays  in  the  feedback  of  dynamic  system.  Using  these  results,  Equation  14  can 
be  written  as 


A* 


dE  ^  d^E  dyk+i 

dyk  ^  dyk+j  dyk 


ffc  + 


dyk+i 

dyk 


where 


(15) 

(16) 


(k  = 


dE 

dyk' 


Equation  16  is  a  backward  difference  equation  which  can  be  solved  using  the  following  boundary  conditions 


A*/  =  ik, 

Afc,  +  i, . .  . ,  Afc^+t  =  0. 


(17) 

(18) 


We  shall  refer  to  equations  16,  17  and  18  as  the  Euler- Lagrange  gradient  equations.  These  gradient 
equations  along  with 
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dw{i) 


form  the  discrete-time  Euler-Lagrange  equations.  Equation  19  guarantees  that  a  solution  of  these  equations 
results  in  either  a  minimum  or  maximum.  It  usually  is  not  possible  to  find  the  analytic  solution  of  these 
equations.  Instead,  numerical  methods,  such  as  first-order  gradient  descent,  are  used  to  search  for  an 
appro.ximate  solution. 

5.1.2  Implementation  of  the  Algorithm 

Having  derived  the  Euler-Lagrange  equations,  we  introduce  the  Euler-Lagrange  based  algorithm  which  is  used 
to  calculated  the  error  gradient  with  respect  to  the  weights  for  an  epoch.  First,  the  discrete-time  dynamic 
equation  of  the  system.  Equation  1,  is  iterated  forward  in  time  from  iteration  0  to  kj.  An  appropriately 
selected  training  set  element  is  used  to  supply  the  boundary  conditions,  Ro,ro, . . .  ,rt^  and  Yq,  for  the 
forward  iteration  of  Equation  1.  Next,  the  error  gradients  with  respect  to  the  outputs.  At,  are  calculated  by 
backward  iterating  Equation  16  using  the  boundary  conditions  of  equations  17  and  18.  Finally,  the  results 
of  the  backward  sweep  are  used  to  compute  the  error  gradient.  Equation  13. 

Tlie  technique  for  calculating  the  error  gradient  presented  above  is  independent  of  the  dynamic  system. 
As  long  as  the  output  gradient  with  respect  to  the  weights  and  recurrent  inputs  exists,  the  epochwise  error 
gradient  of  a  discrete-time  dynamic  system  with  adaptive  parameters  may  always  be  computed  using  this 
method. 

5.1.3  Training  Dynamic  Systems  Composed  of  Neural  Networks 

If  the  system  function,  /,  is  implemented  by  a  feedforward  neural  network,  an  interesting  observation  can  be 
made.  The  error  gradient  summation  computation,  equation  13,  contains  terms  of  the  form  At9yt/5«;t(i). 
It  is  shown  in  Section  3.2  that  terms  of  this  form  can  be  calculated  using  a  backpropagation  provide  that  yt 
is  the  output  of  a  neural  network.  Therefore,  in  the  dynamic  network  case,  each  term  of  Equation  13  can 
be  computed  by  backpropagating  the  vector  A*  through  the  dynamic  network.  Furthermore,  the  summation 
terms  of  the  b^u:kward  sweep  calculation.  Equation  16,  are  of  the  form  ^k+jdyk+j/dyk-  Once  again,  in 
the  dynamic  network  case,  these  terms  may  be  calculated  using  the  backpropagation  algorithm.  Although 
it  may  seem  that  a  large  number  of  backpropagations  are  required  for  each  epoch  to  calculate  the  error 
gradient,  only  kf  +  I  backpropagations  are  needed.  By  backpropagation  of  the  vector  At  at  each  iteration 
of  the  backward  sweep  and  by  storing  the  results  of  this  backpropagation  in  memory,  the  minimum  number 
of  backpropagations  can  be  achieved.  This  technique  of  calculating  the  error  gradient  is  known  as  the 
backpropagat ion- 1 h rough- time  algori thm . 

Neural  network  controllers  can  be  designed  using  the  Euler-Lagrange  based  algorithm.  Systems  with 
the  neural  network  plant  model  may  be  trained  using  the  backpropagation-through-time  algorithm.  If  the 
plant  model  consists  of  a  set  of  equations,  the  summation  terms  of  both  the  backward  sweep  calculation, 
Equation  16,  and  error  gradient  computation.  Equation  13,  may  be  calculated  by  bstckpropagating  the  vector 
Aj  =  Xkdyk/duk  at  each  iteration  of  the  backward  sweep. 

Finally,  a  combination  of  the  Euler-Lagrange  based  algorithm  and  the  MRJII  algorithm  can  be  used  to 
train  a  discrete-time  neural  network  which  is  implement  on  a  VLSI  chip.  The  summation  terms  of  both  the 
backward  sweep  calculation.  Equation  16,  and  the  error  gradient  summation  computation.  Equation  13,  can 
be  computed  using  the  MRIII  algorithm  at  each  backward  iteration  of  the  Euler-Lagrange  based  algorithm. 


5.2  An  On-Line  Algorithm  Based  on  the  Euler-Lagrange  Equations 

The  Euler-Lagrange  based  algorithm  is  an  epochwise  training  technique.  In  many  applications,  such  as 
real-time  filtering  and  adaptive  control,  it  is  necessary  to  allow  on-line  training.  In  these  cases,  an  on-line 
version  of  the  Euler-Lagrange  based  algorithm,  which  is  introduced  in  this  section,  can  be  used. 
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Generally,  in  the  on-line  case,  at  each  forward  iteration  of  the  dynamic  system,  k',  the  error  gradient  is 
first  calculated  and  the  weights  are  updated  based  upon  this  calculation.  Using  the  results  of  the  previous 
section,  the  error  gradient  could  be  calculated  by  iterating  Equation  16,  which  is  repeated  here, 


backwards  through  time  from  iteration  k'  to  0.  Because  it  is  common  to  use  the  mean  squared  error  in  the 
on-line  case,  the  boundary  conditions  of  Equation  19  would  take  the  form 


(20) 

(21) 

(22) 

Finally,  the  results  of  the  backward  sweep  are  summed  to  produce  the  on-line  error  gradient  using 


Xk‘  =  — (dt*-yt<) 

■^*'+1 . Xk>+L  =  0 

foi  •  •  •  1  e*,  •  •  •  if*'-!  =  0. 


Ek‘  _  d'^Ek'  _  «  dyjk 

dwik>)  -  dwk{k>)  -  ^  ^dwk{k') 

which  is  similar  to  Equation  13  of  the  previous  section. 

5.2.1  Problems  Associated  with  On-Line  Implementation 

Two  problems  arise  when  one  attempts  to  determine  the  on-line  error  gradient  of  a  dynamic  system  in  this 
manner.  First,  the  number  of  iterations,  k',  for  most  on-line  applications  quickly  becomes  large.  Because  the 
error  gradient  is  calculated  by  iterating  a  difference  equation  backwards  from  k'  to  0,  the  number  of  com¬ 
putations  required  to  calculate  the  gradient  grows  linearly  with  k'.  Obviously,  this  technique  of  calculating 
the  gradient  quickly  becomes  computationally  expensive.  Instead  of  using  an  algorithm  whose  computations 
grows  linearly  with  the  current  iteration  count,  it  is  better  to  use  an  algorithm  whose  computations  remain 
constant  and  are  independent  of  the  current  iteration  count.  This  can  be  accomplished  by  iterating  Equa¬ 
tion  19  backwards  through  time  a  constant,  T,  number  of  iterations.  Using  this  idea,  the  error  gradient  is 
calculated  by  first  iterating  Equation  16  backwards  in  time  from  iteration  k'  to  k'  -T  using  the  appropriate 
boundary  conditions.  After  this  computation,  the  error  gradient  is  calculated  using 


d'*'Ek>  ^  ^  d*Ek>  _  .  dytc 

dw{k')^  ^^^_^dwk{k')  ^  dwkik'Y  ^  ’ 

Of  course,  the  error  gradient  computed  using  this  method  is  an  approximation  of  the  true  gradient.  An 
example  will  illustrate  the  nature  of  this  approximation.  Figure  5,  shows  the  values  of  d'^ Ek' /dtvk(k')  for 
some  given  dynamic  system.  By  summing  only  a  portion  of  these  terms,  it  can  be  observed  in  Figure  6  that 
the  resulting  approximate  error  gradient  is  a  windowed  version  of  the  true  gradient.  Thus,  the  validity  of 
the  approximate  gradient  depends  upon  how  much  of  the  grtuiient  lies  outside  of  the  window  of  length  T. 

The  second  problem  with  directly  using  the  Euler- Lagrange  equations  in  the  on-line  case  results  from  the 
weight  changes  at  each  iteration.  In  the  on-line  case,  the  system  difference  equation  is 


yt  =  /(R*,Yfc.Wt(F)) 


(2-1) 
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Figure  5:  Error  Gradients  of  a  Typical  System. 

where  the  weight  vector,  Wjt,  changes  at  each  iteration  according  to  Equation  7.  The  on-line  error  gradient, 
dEk' /dw{k'),  is  defined  with  respect  solely  to  the  4'**  weight  vector.  However,  this  weight  vector  is  only 
used  at  iteration  4'.  Because  the  weight  w(k‘)  only  appears  at  iteration  4',  one  could  conclude  incorrectly 
that  the  error  gradient  could  be  calculated  based  solely  upon  iteration  k'.  This  is  incorrect  because  the 
weight  vectors  are  related  by  Equation  7.  In  fact,  in  most  cases  of  interest,  the  weights  change  slowly  from 
iteration  to  iteration  because  the  learning  rate,  fx,  of  Equation  7  is  small.  Under  this  condition,'we  find 


W(4)  as  W(4  -  1). 


Using  this  approximation,  the  on-line  error  gradient  summation  of  Equation  23,  can  be  written  as 


^  ^  d^Ek'  _  »  dyic  -2  . 

drvik')  dwuik)  -  dw,ik) 

where  Wk{k)  has  replaced  Wk(k')  in  the  partial  derivative  dyn/dwkik).  The  combination  of  Equation  25  and 
a  b2u:kward  sweep  of  length  T  -f- 1  allows  the  calculation  of  an  approximate  error  gradient. 

The  error  gradient  approximation  can  be  improved  by  exponentially  weighting  the  terms  of  Equation  25. 
Because  in  most  cases  the  weight  change  between  w{k)  and  11^(4')  tends  to  become  larger  as  4  is  decreased 
starting  from  4',  it  can  be  argued  that  the  approximation  dEk'/dwk{k)  ft;  dEk'/dwic{k')  becomes  less  valid 
as  4  decreases.  Under  this  assumption,  when  calculating  the  error  gradient,  the  influence  of  the  less  accurate 
terms  of  Equation  25  should  proportionally  be  reduced.  This  can  be  accomplished  using  the  following 
exponential  weighting  scheme 


dEk'  ^  ^  k'-k  d'^Ek'  _  ^y* 

^dwk(k)' 


(26) 
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where  the  constant  0.0  <  a  <  1.0  is  the  weighting  coefRcient.  The  addition  of  exponential  weighting  causes 
the  window  to  have  the  form  shown  in  Figure  7.  It  should  be  noted,  that  exponential  weighting  may  be 
desirable  even  in  the  epochwise  training  case  if  the  backward  sweep  computation  of  Equation  16  is  unstable. 

5.2.2  Implementation  of  the  Algorithm 

Having  derived  approximate  solutions  for  the  two  problems  of  implementing  the  Euler- Lagrange  equations 
on-line,  we  can  now  present  the  on-line  Euler-Lagrange  based  algorithm.  The  algorithm  is  based  upon  the 
following  sequences  being  performed  at  each  iteration  k':  a  forward  propagation  based  on  Equation  24,  a 
backward  sweep  of  length  T-\-\  using  Equation  19  and  the  boundary  conditions  of  equations  20,  21,  and  22, 
a  calculation  of  the  error  gradient.  Equation  26,  and  an  update  of  the  weights  based  upon  Equation  7. 

For  a  dynamic  system  with  Q  weights,  the  on-line  algorithm  outlined  above  requires  that  (T-f  1)Q  weight 
vectors  be  stored.  In  most  cases,  this  is  an  impractical  amount  of  storage.  Under  these  circumstances,  an 
approximate  output  vector  of  the  form 


yt«/(R*,Yt,W*(ib'))  (27) 

may  be  used  in  Equation  19,  the  backward  sweep  equation,  instead  of  the  output  vector  defined  in  Equa¬ 
tion  24.  Equation  27  is  a  function  of  only  one  weight  vector,  W(lt'),  therefore,  only  this  vector  needs  to  be 
stored.  Using  the  approximation  of  Equation  27  results  in  a  good  approximation  of  dEii'/dwif{k')  provided 
that  the  weights  change  slowly. 

In  the  on-line  case,  the  summation  terms  of  both  the  backward  sweep  calculation,  Equation  19,  and  the 
error  gradient  computation,  Equation  26,  may  be  calculated  in  exactly  the  same  manner  as  in  the  epochwise 
case,  which  is  discussed  in  the  last  part  of  Section  5.1.  Therefore,  dynamic  systems  composed  of  a  neural 
network  or  neural  controller-plant  system  may  be  trained  on-line.  The  on-line  Euler-Lagrange  algorithm  can 
be  used  in  conjunction  with  the  backpropagation  algorithm  to  calculate  the  error  gradient  of  a  dynamic  neural 
network.  This  technique  of  calculating  the  on-line  error  is  known  as  on-line  backpropagation-through-time. 
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Figure  7:  Exponentially  Windowed  Error  Gradients. 

One  difficulty  associated  with  the  on-line  Euler-Lagrange  based  algorithm  is  the  selection  of  appropriate 
values  for  the  constants  T  and  a.  Like  the  selection  of  the  learning  rate,  fi,  there  are  no  analytic  procedures 
for  choosing  these  constants.  Instead,  the  selection  should  be  based  upon  knowledge  of  the  dynamic  system, 
desired  convergence  rate  and  required  misadjustment  upon  convergence. 

5.3  Recursive  Gradient  Update  Algorithm 

The  recursive  gradient  update  algorithm  provides  yet  another  method  for  adapting  a  discrete-time  dynamic 
system  composed  fully  or  partially  of  a  neural  network.  The  earliest  version  of  the  algorithm,  which  adapted 
a  single  linear  node,  was  introduced  by  White  in  1975  [13].  The  algorithm  received  much  attention  during 
the  later  70’s  in  the  adaptive  signal  processing  community.  It  was  found  to  suffer  from  stability  problems 
and  much  of  the  recent  research  has  been  dedicated  to  overcoming  this  problem  [14].  The  recursive  gradient 
update  algorithm  for  nonlinear  networks  became  well  known  because  of  a  paper  by  Williams  and  Zipser  [1]. 
Although  this  paper  dealt  only  with  single  layer  nonlinear  networks,  their  version  of  the  recursive  algorithm 
could  be  generalized  to  multilayered  networks  by  appropriatately  selecting  the  connections  between  input 
and  output.  In  this  section,  the  recursive  algorithm  for  a  general  system,  which  may  be  composed  partially 
or  fully  of  a  neural  network,  is  presented. 

In  both  the  epochwise  and  on-line  cases,  the  recursive  algorithm  utilizes  first-order  gradient  descent  to 
minimize  an  appropriate  error  function.  The  difference  between  the  Euler-Lagrange  based  algorithm  and  the 
recursive  algorithm  is  the  method  in  which  the  error  gradient  with  respect  to  the  weights  is  calculated.  To 
gain  an  understand  of  the  difference  between  the  algorithms,  the  epochwise  version  of  recursive  algorithm  is 
developed. 

5.3.1  Epochwise  Error  Gradient 

The  Euler-Lagrange  based  algorithm  is  derived  using  the  first  chain  rule  expansion.  Equation  2.  In  this  sec¬ 
tion,  the  second  chain  rule  expansion.  Equation  3,  is  used  to  derive  the  recursive  gradient  update  algorithm. 
Once  again,  we  assume  the  error,  E,  is  calculated  using  the  ordered  set  of  equations  shown  in  Section  5.1.1. 
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We  begin  by  expanding  the  error  gradient  using  :  ii?  oni  chain  rule  expansion,  Equation  3,  to  obtain  the 
following  result. 


^  dE  ^  fdE  d+yk  be 

dw(i)  dw(i)  ^  dw(i)  ^  5Wt(»)  dw(i)  ) 

The  two  terms,  dE/dw{i)  and  dE/dVf are  equal  to  zero  because  the  error,  E,  is  not  a  direct  function 
of  auid  Wfc(i).  Therefore,  Equation  28  can  be  written  as 


d^E  BE  d+Yk 

Btt){i)  ~  Byt  Bw(i)' 


(29) 


The  first  term  of  Equation  29,  BEjByk,  is  easy  to  compute.  The  second  term,  d‘^yklBw{i),  can  he  found 
by  using  the  .second  chain  rule  expansion  once  more. 


B^y^  ^  Byu  ^  By,  B^W^ji)  ^  dyt  B+y^ 

dw(i)  dw(i)  ^  BWj(i)  dw(i)  ^  dyj  Bw(i)  ^ 

The  term  Byk/Bw{i)  equals  zero.  The  term,  Byk/dVi j{i),  of  the  first  summation  is  nonzero  only  when 
k  =  j,  therefore,  this  summation  only  contains  one  nonzero  term.  Furthermore,  it  is  easy  to  verify  that  this 
term  can  be  simplified  to  Byk/Bwk(i)-  Finally,  the  first  term  of  the  second  summation,  Byk/Byj,  is  nonzero 
only  when  k  —  j  <  L.  Using  these  results,  Equation  30,  can  be  written  as 


B'*’yk  _  Byk  ^  Byk  B+yk-j 

dw(i)  Bwk(i)  ~  Byk-j  dw(i) 

The  summation  in  Equation  31  can  be  eliminated  using  Y*  =  [yt-i, . . .  .yt-i]^ 


B-*-yk  _  Byk  Byk  B-*-Yk 

Bw{i)  Bwk(i)  dYk  Bw(i)  ’  ^ 

This  equation  can  be  used  to  recursively  calculate  the  output  gradients  for  the  entire  epoch.  It  is  usually 
initialized  using 


B+Yq 

Bw(i) 


=  0. 


(33) 


5.3.2  Implementation  of  the  Algorithm 

The  epochwise  error  gradient  can  be  calculated  by  first  using  Equation  32  to  determine  B'*’yk/Bw(i)  for  each 
iteration  of  the  epoch.  Once  these  ordered  derivatives  are  determined,  the  error  gradient  can  be  calculated 
using  Equation  29.  In  order  to  calculate  the  error  gradient  using  this  method,  the  output  gradient  at  each 
iteration  of  the  epoch  must  be  available.  If  the  outputs  are  stored  in  memory,  a  total  of  (t/  +  1)NQ  memory 
slots  are  required,  where  <V  is  the  number  of  outputs  and  Q  is  the  number  of  weights.  In  many  cases,  this 
amount  of  memory  may  not  be  available.  The  memory  requirements  may  be  reduced  to  (L  +  l)A  Q  using  a 
recursive  calculation.  Let  Sk,  an  intermediate  error  gradient  sum,  be  defined  as 
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Using  the  recursive  equation 


dE  d'^yj 
dyj  5u;(i)  ’ 


5t+i  =  + 


dE  d-^yt+i 
dyt+i  dw(i) 


(34) 


which  is  initialized  by 


_  dE  d-^yo 
°  dyo  dw(i) 


(35) 


it  follows 


d+E 

dw(i) 


=  S*,. 


The  epochwise  recursive  update  algorithm  is  implemented  using  the  equations  derived  immediately  above 
to  calculate  the  error  gradient.  For  a  given  epoch,  the  feedback  input  gradient,  d*Yo/dw{i),  is  initialized 
using  equation  33  while  the  intermediate  error  gradient  sum,  Sk,  is  initialized  using  Equation  35.  At  each 
iteration,  from  the  initial  iteration  k  =  0  to  the  final  iteration  k  =  kf,  the  following  sequence  is  performed; 
the  system  is  forward  propagated  using  the  function  /  defined  in  Equation  1,  the  output  gradient  is  conv 
puted  using  the  recursive  calculation  of  Equation  32  and  the  intermediate  error  gradient  is  updated  using 
Equation  34.  Because  St  and  d*yk/dw{i)  are  calculated  at  each  iteration,  only  the  previous  L  output  gra¬ 
dients  need  to  be  stored  in  memory,  therefore,  the  storage  requirements  of  the  algorithm  are  approximately 
{L  +  \)NQ.  After  the  final  iteration,  the  error  gradient  is  available  as  Skf 

In  order  to  update  the  output  gradient  at  each  iteration  using  Equation  32,  it  is  necessary  to  compute 
the  direct  output  gradient,  dyk/dwk{i),  and  the  Jacobian  matrix,  dyk/dYk-  Many  different  techniques 
can  be  used  to  calculate  these  terms  depending  upon  the  form  of  the  dynamic  system.  As  shown  below,  if 
the  structure  of  the  discrete-time  dynamic  system  is  a  neural  network,  these  two  components  can  be  found 
using  N  backpropagations  of  N  appropriately  selected  vectors  {Aj, . . . ,  A„, . . . ,  A;v}.  These  vectors  are  all 
backpropagated  through  the  k*^  iteration  of  the  dynamic  network. 

The  N  vectors  take  the  form 


A  _  /  1  if  n  =  ;■ 

\  0  otherwise 

where  A„^  is  the  element  of  the  row  vector  A„  €  Each  vector  has  only  one  nonzero  component  in 

the  n  =  j  column.  Because  y*  is  the  output  vector  of  a  neural  network,  the  backpropagation  of  the  vector 
X„  through  the  network  at  iteration  k  results  in  the  calculation  of  the  output  gradient  of  the  output,  as 
indicated  by 


.  dyt  _  dyk(n) 
'*5u)t(i)  dwkli)' 


Furthermore,  as  shown  by. 


17 


.  5yt  _  5yt(n) 

"  dYk  dYt 


backpropagating  the  vector,  A„,  back  to  the  input  nodes  results  in  the  calculation  of  the  row  of  the 
Jacobian  matrix.  This  shows  that  the  direct  output  gradient,  dyt/du:jt{i),  and  Jacobian  matrix,  Oyk/dYk- 
can  be  computed  using  .V  backpropagations  through  the  t'*  iteration  of  the  dynamic  network.  Using  this 
technique  in  the  epochwise  recursive  gradient  update  algorithm  shall  be  referred  to  as  epochwise  recursive 
back  propagation. 

The  computation  of  the  direct  output  gradient  and  the  Jacobian  matrix  for  a  neural  controller-plant 
system  is  similar  to  the  calculation  for  the  dynamic  neural  network.  In  fact,  if  the  plant  model  is  a  neural 
network,  the  technique  discussed  above  can  be  used  directly.  If  the  plant  model  is  based  upon  a  set  of  equa¬ 
tions,  once  again,  the  direct  output  gradient  and  Jacobian  matrix  can  be  computed  using  N  backpropagations 
through  the  neural  network.  In  this  case,  the  backpropagated  vectors  are  of  the  form 


dui 

The  back  propagation  of  A'  through  the  neural  controller  of  iteration  k  results  in  the  computation  of  the 
direct  output  gradient  as  indicated  by 


y  dyk  _  dykjn)  duk  _  gyt(n) 

"5tD*(i)  5u*  5ti;t(i)  dwk(i)' 

Similarly,  it  can  be  shown  that  the  n‘*  row  of  the  Jacobian  matrix  can  be  computed  using  the  backpropagation 
of  AJ,  to  the  inputs  of  the  neural  controller. 

If  the  system  is  composed  of  a  neural  network  implemented  in  hardware,  the  two  terms  dyk/dwk(i)  and 
dyk/dYk  of  Equation  32  can  be  computed  using  MRJII.  The  MRIII  algorithm  uses  the  output  gradient  to 
calculate  the  error  gradient.  Therefore,  this  algorithm  can  be  used  without  modification  to  find  the  two 
terms  of  Equation  32. 

The  epochwise  recursive  algorithm  can  be  used  to  calculate  the  epochwise  error  gradient  of  any  discrete¬ 
time  dynamic  system.  In  the  next  section,  we  show  that  this  algorithm  is  easily  extended  to  the  on-line 
case. 


5.4  On-Line  Recursive  Gradient  Update  Algorithm 

The  on-line  version  of  the  recursive  gradient  update  algorithm  is  easily  derived  from  the  epochwise  version. 
The  calculation  of  the  output  gradient  at  each  iteration  performed  by  the  epochwise  version  of  the  algo¬ 
rithm  depends  only  upon  the  current  and  past  values  of  the  dynamic  system.  The  lack  of  dependence  on 
future  values  of  the  network  in  cadculating  the  output  gradient  makes  the  algorithm  attractive  for  on-line 
i  mplementations . 

In  the  on-line  case,  the  mean  squared  error  is  often  minimized  by  updating  the  weights  at  each  iteration 
based  upon  an  error  gradient  of  the  form 


dEk> 

dw(k') 


dEk'  d'^’yk' 
dyk'  dw(k') 


-(dk'  -  yk'f 


d'^'yk’ 

dw{k')' 


(36) 

(37) 
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Tlie  second  term  of  K'niation  37  may  be  calculated  using  the  recursive  gradient  update  calculation,  Enua- 
tion  32,  which  was  derived  in  the  previous  section.  Using  this  approach,  the  squared  error  is  reduced  at  each 
iteration,  and  the  mean  squared  error  is  approximately  minimized. 

The  calculation  of  the  output  gradient  using  the  recursive  update  computation, 


_  dyk'  dyt'  d+Yf 

dw{k')  dvuk-ik’)'^  dYk' dw(k>)  ’ 

is  based  upon  the  assumption  that  the  weights  are  constant.  However,  in  the  on-line  case,  the  weights  change 
from  iteration  to  iteration  according  to  Equation  7.  In  order  to  use  the  recursive  equation,  it  is  assumed 
that  the  weights  change  slowly  or  equivalently 


W(Jb)  w  W(Jfc  -  1). 


This  assumption  allows 


d'*'Yk'  _  d-^Yk'-i^ 

dw(k')  dw(k')  '  dw(k')  ’  ’  dw(k') 

of  Equation  38  to  be  approximated  as 


d'*‘Yk'  ^  d'*'yk‘-i  ^  d'*’yk‘-7  ^  d'^yk‘-L  ^ 

dw{k') dw{k' —  1)  'dw{k'  —  2)  ’ '  ' '  dw{k' —  L) 

Because  of  this  approximation,  the  on-line  error  gradient  of  Equation  37  can  only  be  approximately  calcu¬ 
lated. 

In  a  manner  similar  to  the  on-line  Euler- Lagrange  based  algorithm,  the  accuracy  of  the  error  gradient 
approximation  can  be  improved  by  introducing  exponential  weighting.  The  exponentially  weighted  output 
gradient  can  be  calculated  using 


dyk‘  ^  dyk-  dyk' 

dw{k')  ^  dwk'{k')  dYk'  dw{k') 


(39) 


where  the  diagonal  matrix  F  €  is  used  to  perform  the  exponential  weighting.  In  general,  F  takes 

the  form 


where 


F  = 


0  0  .  .  0 

0  ^2  0  .  . 

0  .  .  . 

.  .  .  .  0 

0  .  0  <t>^ 


o  = 


o  0  .  .  0 

0  Q  0  .  . 

.  0  .  ,  . 

.  .  .  .  0 

0  .  .  0  a 


(10) 
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with  c  G  and  0.0  <  q  <  1.0. 

The  on-line  recursive  algorithm  with  exponential  weighting  is  easily  implemented.  The  f.*edback  input 
gradient,  which  is  used  in  the  recursive  gradient  update  equation,  is  initialize  as  indicated  by  Equation  153. 
The  elements  of  the  first  weight  vector,  W(0),  are  randomly  initialized.  .\t  each  iteration,  the  following 
sequence  is  performed:  a  forward  propagation  implemented  by  Equation  24.  a  recursive  upd.ate  of  the 
output  gradient  by  Equation  39.  a  calculation  of  the  error  gradient  which  utilizes  Eipiation  37  and  an  update 
of  the  weights  using  Equation  7. 

The  on-line  recursive  algorithm  can  be  used  to  adapt  the  weights  of  a  dynamic  neural  network  or  neural 
controller-plant  system.  As  discussed  in  the  previous  section,  in  these  cases,  the  two  terms,  clyt'/tluffl’') 
and  dyt'/dYk',  can  be  calculated  using  .V  backpropagations.  The  combination  of  the  on-line  recursive 
algorithm  and  the  backpropagation  algorithm  to  adapt  a  dynamic  network  shall  be  referred  to  as  the  on-line 
recursive  backpropagation  algorithm.  In  addition,  a  neural  netw'ork  which  is  implemented  on  a  VLSI  chip 
can  be  adapted  on-line  using  a  combination  of  the  on-line  recursive  algorithm  and  the  MRIII  algorithm.  The 
on-line  recursive  algorithm  provides  an  alternative  technique  to  the  on-line  Euler- Lagrange  based  algorithm. 
A  comparison  of  these  algorithms  is  presented  in  the  Section  7.  This  comparison  will  allow  proper  s<-lection 
of  an  on-line  algorithm  for  a  given  problem. 


6  Comparison  of  the  Algorithms 

Although  the  Euler-Lagrange  based  algorithm  and  recursive  gradient  update  algorithm  may  appear  to  be 
quite  different,  they  both  perform  first-order  gradient  descent  in  weight  space.  In  the  epochwise  case  where 
no  approximations  are  required  to  calculate  the  error  gradient,  the  two  algorithms  are  equivalent.  In  this 
case,  because  the  selection  of  algorithm  does  not  affect  the  convergence  rate,  the  algorithm  should  be  chosen 
on  the  basis  of  computational  complexity  and  storage  requirements  which  are  discussed  in  Section  7. 

In  the  on-line  case,  the  two  algorithms  result  in  approximately  identical  weight  updates  given  the  same 
set  of  inputs.  This  can  be  shown  using  the  formulation  of  Section  5.  The  exponentially  weighted  on-line 
error  gradient  of  the  Euler-Lagrange  based  algorithm,  shown  in  Equation  26,  is  repeated  here  for  comparison 
with  the  on-line  gradient  calculated  by  the  recursive  algorithm. 


au;(fc')  ~  dwk{k)' 


An  equation  similar  to  Equation  41  can  be  derived  for  the  on-line  recursive  algorithm.  Using  induction,  it  can 
be  proved  that  the  on-line  recursive  output  gradient  calculation  of  Section  5.4,  Equation  39,  is  approximated 
as 


a+yt,  _  dyic’  dyk'  d-*-Yk'  _  t'-t 
dw(k')  ~  dwk‘(k')  dYk'  dw{k')  dwk{k) 

Substituting  this  result  into  the  error  gradient  calculation  of  the  on-line  recursive  algorithm.  Equation  36, 
the  weighted  on-line  error  gradient  calculated  by  the  recursive  algorithm  is 


d'*'Ek-  ^  dEk'  k'-k  d'*'yk’  _  (.p2) 

dw(k')  dyk'  dwk(k)  ^  dwk(k)' 

The  two  on-line  techniques  are  approximately  equivalent  when  equations  41  and  42  are  approximately  equal. 
This  occurs  when 
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(43) 


k'-T-l 


E 


t'-k 

du’k(k) 


ss  0 


By  appropriate  selection  of  a,  one  can  guarantee  that  Equation  43  is  made  arbitrarily  close  to  zero.  Thus, 
we  can  conclude  that  the  two  on-line  algorithms  are  approximately  equivalent  and  that  the  validity  of  this 
approximation  depends  upon  satisfying  Equation  43. 


7  Computational  Complexity  and  Storage  Requirements 

If  the  two  algorithms  are  equivalent  in  both  the  epochwise  and  on-line  cases,  then  how  does  one  choose  which 
algorithm  to  use?  Obviously,  the  choice  should  be  based  upon  practical  issues  such  as  computational  efficiency 
and  storage  requirements.  In  this  section,  the  computational  and  storage  requirements,  with  the  system 
architecture,  /,  implemented  by  a  neural  network,  are  analyzed  and  compared.  We  choose  to  present  only  the 
dynamic  neural  network  case  for  two  reasons.  First,  the  dynamic  neural  network  system  is  the  most  common 
form  of  discrete-time  dynamic  system  encountered  in  the  neural  networks  field.  Second,  the  computation 
and  storage  requirements  are  easy  to  calculate  for  any  system  once  one  understands  these  requirements  for 
the  dynamic  neural  network.  Because  we  assume  a  neural  network  structure  of  the  system  architecture, 
the  computational  and  storage  requirements  of  the  backpropagation-through-time,  on-line  backpropagation- 
through-time,  epochwise  recursive  backpropagation  and  on-line  recursive  backpropagation  algorithms  are 
discussed  below. 

All  four  of  these  algorithms  are  based  on  the  backpropagation  algorithm.  Therefore,  an  understanding  of 
the  computation  requirements  of  this  algorithm  is  necessary  before  deriving  the  complexity  of  the  dynamic 
network  algorithms.  Epochwise  training,  using  the  backpropagation  algorithm,  consists  of  a  forward  propa¬ 
gation,  a  backward  propagation  and  a  weight  update.  Each  of  these  computations  requires  on  the  order  of 
Q  multiplications  and  additions,  where  Q  is  the  number  of  weights  in  the  network.  Therefore,  the  epochwise 
computational  requirement  of  the  backpropagation  algorithm  is  on  the  order  of  3Q  multiplications  and  addi¬ 
tions.  Throughout  the  remainder  of  this  section,  we  shall  use  the  term  operation  to  refer  to  a  multiplication 
and  addition. 

7.1  Backpropagation-Through-Time  Based  Algorithms 

The  backprcpagation-through-time  algorithm,  which  is  an  epochwise  technique  of  adapting  a  dynamic  sys¬ 
tem,  is  based  upon  repeated  forward  and  backward  propagations  through  the  dynamic  network.  For  any 
given  epoch,  kj  +  I  forward  and  backward  propagations  are  required.  Because  each  of  these  propagations 
requires  on  the  order  of  Q  multiplications  and  additions,  2{kj  -1-  1)Q  operations  are  needed  to  calculate  the 
error  gradients,  d*  E/dwk{i),  of  each  epoch.  Using  Equation  13  to  compute  the  epochwise  error  gradient  and 
Equation  5  to  update  the  weights  requires  approximately  (iy  +  1)Q  operations.  (In  calculating  the  epochwise 
computational  requirements,  it  is  assumed  that  the  weights  are  updated  at  each  epoch.)  The  total  number 
of  multiplications  and  additions  using  the  backpropagation-through-time  algorithm,  is 


Cbptt  ^  3(kj  +  I)Q.  (44) 

The  storage  requirement  of  the  backpropagation-through-time  algorithm  is  derived  from  two  primary 
components.  First,  the  weights  and  their  associated  error  gradients  need  to  be  stored  in  memory  for  efficient 
computation.  These  terms  require  2Q  floating  point  memory  slots.  Secondly,  the  output  vector,  y*  and 
external  input  vector,  r*  of  the  dynamic  system  at  each  iteration  of  the  epoch  are  required  for  the  calculation 
of  the  backward  sweep.  In  order  to  minimize  the  storage  requirements,  only  the  inputs  and  outputs  need 
be  stored.  The  internal  states  of  the  system,  such  as  the  hidden  node  activation  levels,  can  be  recalculated 
from  the  input  and  output  vectors.  It  should  be  noted,  that  minimization  of  the  storage  requirements  may 
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increase  the  computation  requirements  by  at  most  {kj  +  \)Q  because  of  the  need  to  recalculate  internal 
states.  The  external  input  is  composed  of  \f  floating  point  numbers  while  the  output  contains  .V  terms. 
Thus,  (kj  +  1)(A/  +  .\  )  memory  slots  are  required  for  the  external  inputs  and  outputs  of  the  dynamic  system. 
Adding  the  two  components  together,  the  minimal  storage  requirement  of  the  backpropagation-through-time 
algorithm  is 


Sbptt  ^  2Q  +  {k}  +  1)(A/  +  A'), 

The  computational  complexity  of  the  on-line  backpropagation-through-time  algorithm  is  based  upon  the 
number  of  operations  per  iteration.  These  requirements  are  easily  derived  from  those  of  the  backpropagation- 
through-time  algorithm.  At  each  iteration,  the  error  gradient  is  calculated  using  T  -f-  1  backpropagations, 
and  the  weights  are  updated  based  upon  this  gradient.  The  backpropagations  and  weight  updates  require 
2(7’  -b  1)Q  operations.  In  addition  to  these  computations,  one  forward  propagation  of  the  dynamic  system, 
which  requires  Q  multiplications  and  additions,  is  necessary.  Using  these  calculations,  the  computation 
requirements  per  iteration  of  the  on-line  algorithm  is 


CoBPTT  ^  (27’  -I-  3)Q. 

In  order  to  achieve  the  minimal  storage  requirements  of  the  on-line  backpropagation-through-time  al¬ 
gorithm,  the  output  vector,  yj,,  must  be  defined  as  shown  in  Equation  27.  In  this  case,  only  one  weight 
vector  need  be  stored  in  memory.  In  addition  to  the  weights,  the  associated  error  gradients  of  each  weight 
must  be  stored.  FinaJly,  only  the  output  vector  and  external  input  vector  of  the  previous  T  +  I  itera¬ 
tion  are  needed  for  calculation  of  the  error  gradient.  Thus,  the  minimal  storage  requirement  of  the  on-line 
backpropagation-through-time  algorithm  is 


SoBPTT  ^  2Q  +  (T  +  1)(A/  +  N). 

7.2  Recursive  Backpropagation  Algorithms 

The  epochwise  recursive  backpropagation  algorithm  is  baised  upon  a  forward  propagation,  a  recursive  update 
of  the  output  gradient  and  a  recursive  update  of  the  error  gradient  being  performed  at  each  iteration.  The 
weights  are  updated  at  the  final  iteration  based  upon  the  error  gradient.  The  computational  complexity  of  the 
forwaird  propagations  is  {kj  +  1)Q.  The  complexity  of  calculating  the  kj  +  I  output  gradients  is  determined 
by  the  computational  requirements  of  the  recursive  gradient  update  calculation,  Equation  32.  In  the  dynamic 
network  case,  the  two  terms  dyt/dwkii)  and  dyt/dYk  are  computed  using  N  backpropagations.  In  addition 
to  the  calculation  of  these  two  terms,  a  matrix- vector  multiplication,  which  requires  N^L  operations,  must  be 
performed  to  compute  the  second  term  of  Equation  32.  Finally,  the  addition  of  the  two  terms  of  Equation  32 
requires  N  operations  per  weight.  Adding  all  these  components  together,  the  computation  complexity  of  the 
recursive  output  gradient  calculation.  Equation  32,  is  (N^L  -)-  2N)Q.  The  complexity  of  finding  all  output 
gradients  is  (fcy  -b  2N)Q.  At  each  iteration,  the  error  gradient  is  updated  using  Equation  34  which 

requires  a  minimum  of  N  operations  per  weight.  The  computation  requirements  of  updating  the  Q  error 
gradients  over  the  fc/  -b  1  iterations  of  the  epoch  is  {kf  -b  l)NQ.  Finally,  Q  multiplications  and  additions  are 
needed  to  update  the  weights.  The  computational  requirements  of  the  epochwise  recursive  backpropagation 
algorithm  is 


CRB=={kf +  1){N^L  +  3N +l)Q  +  Q.  (45) 

As  pointed  out  in  Section  5.3.2,  the  storage  requirements  are  determined  by  the  recursive  output  gradient 
calculation.  Equation  32.  Therefore,  the  storage  requirement  of  the  epochwise  recursive  backpropagation 
algorithm  is 
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Epochwise  Algorithms 

On-Line  Algorithms 

Requirements 

Backpropagation- 
through  time 

Recursive 

Backpropagation 

Backpropagation- 

through-time 

Recursive 

Backpropagation 

Computational 

3(lq+  1)Q 

(kj+  1)(N^L  +  3N+  1)Q+Q 

(2T  +  3)Q 

(N^L  +  N(L  +  3)  +  2)Q 

Storage 

2Q  +  (kf+  1)(M+N) 

(L  +  1)  NQ 

2Q  +  (T  +  1)(M+N) 

(L+  1)NQ 

Table  1:  Computation  and  Storage  Requirements. 


Srb^(L+1)NQ. 

The  complexity  of  the  on-line  recursive  backpropagation  algorithm  follows  almost  immediately  from  the 
computational  requirements  of  the  epochwise  algorithm.  At  each  iteration,  the  error  gradient  is  calculated 
using  an  exponentially  weighted  recursive  update  calculation,  Equation  39.  The  only  difference  between  this 
equation  and  the  one  used  in  the  epochwise  case  is  the  exponential  weighting.  Because  the  weighting  constant, 
F  e  ,  is  a  diagonal  matrix,  LN  operations  per  weight  are  introduced  by  this  matrix  multiplication. 

Therefore,  the  recursive  update  calculation  of  Equation  39  requires  {N^L  -b  N{L  -1-  2))Q  operations.  The 
error  gradient.  Equation  37,  is  calculated  using  at  least  NQ  operations.  Finally,  the  forward  propagation  of 
the  system  and  the  weight  update  both  require  Q  operations.  The  computational  complexity  per  iteration 
of  the  on-line  recursive  backpropagation  algorithm  is 


CoRB  «  {N'^L  -b  N{L  -b  3)  -b  2)Q. 

The  only  differences  between  the  on-line  and  epochwise  recursive  algorithms  are  the  exponential  weighting 
of  the  recursive  update  equation  and  the  weight  update  at  each  iteration  of  the  on-line  case.  These  two 
differences  do  not  account  for  any  difference  in  storage  requirements.  Therefore,  the  storage  requirement  of 
the  on-line  recursive  backpropagation  is 


Sorb  «  (fi  -b  l)NQ. 

which  is  the  same  as  that  required  for  the  epochwise  recursive  backpropagation  algorithm. 

7.3  Comparison  of  Algorithms 

The  computational  and  storage  requirements  are  outlined  in  Table  1.  Comparing  the  computational  com¬ 
plexity  of  the  two  epochwise  algorithms,  equations  44  and  45,  we  find  the  epochwise  backpropagation- 
through-time  algorithm  to  be  computationally  more  efficient  than  the  epochwise  recursive  backpropagation 
algorithm.  In  general,  epochwise  algorithms  based  on  the  Euler-Lagrange  equations  are  more  efficient  than 
those  based  on  the  recursive  update  equation.  The  computational  inefficiency  of  the  recursive  technique  is 
a  result  of  the  output  gradient  calculation  which  requires  a  matrix-vector  multiplication  for  each  weight. 
Although  this  calculation  introduces  inefficiency  into  the  recursive  backpropagation  algorithm,  it  has  the 
advantage  of  fixing  the  storage  requirements  to  (L  +  \)NQ,  which  is  independent  of  the  number  of  iterations 
in  an  epoch.  In  some  cases,  where  the  total  number  of  iterations,  kj  -b  1,  is  large  compared  to  the  number  of 
weights,  Q,  the  epochwise  recursive  backpropagation  algorithm  may  be  advantageous  to  use  for  this  reason. 
However,  in  many  cases  the  total  number  of  iterations  is  small  compared  to  the  number  of  weights  and  the 
storage  requirements  favor  use  of  the  epochwise  backpropagation-through-time  algorithm. 
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The  ratio  of  the  computational  requirements  of  the  two  on-line  algorithms 


Cqbptt  ^ _ (‘2T  +  3)Q _ 

CoRB  ^  + \{L  +  3)  +  2)Q  ^ 

can  be  used  to  compare  the  efficiencies  of  the  two  on-line  algorithms.  The  most  efficient  on-line  inetliod 
can  be  cliosen  on  the  basis  of  the  number  of  outputs,  ,V,  the  maximum  delay  in  the  feedback  loop.  L,  and 
the  error  gradient  window  length  of  the  backpropagation-through-time  algorithm,  T.  Using  the  ratio  of 
Equation  46,  a  couple  of  general  statements  can  be  made  about  selection  of  an  on-line  algorithm  based  on 
computational  efficiency.  First,  for  systems  with  a  small  number  of  outputs,  N,  and  a  maximum  delay,  L. 
less  than  the  window  length,  T,  the  on-line  recursive  algorithm  is  the  most  efficient  technique  of  updating 
the  weights.  A  general  class  of  systems  which  meet  these  conditions  are  single  output  HR  adaptive  filters. 
With  this  in  mind,  it  is  not  surprising  that  in  the  adaptive  filler  field,  the  computationally  more  elficient 
on-line  recursive  technique  has  been  well  studied  [9,14]  while  we  are  unaware  of  any  attempts  to  use  the 
on-line  backpropagation-through-time  algorithm.  Secondly,  for  dynamic  networks  with  a  large  number  of 
outputs,  N ,  or  a  large  number  of  delays,  L.  the  backpropagation-through-time  algorithm  is  most  elficient. 
Fully  recurrent  networks,  which  have  a  large  number  of  outputs  because  each  node  is  regarded  as  an  output, 
should  be  adapted  using  the  on-line  backpropagation-through-time  algorithm.  The  algorithm  should  also  be 
used  to  train  multidimension  adaptive  filters. 

In  addition  to  the  computational  efficiency,  the  storage  requirements  of  the  two  on-line  algorithms  were 
derived  in  the  previously  in  this  section.  On  the  basis  of  the  storage  requirements  alone,  the  on-line 
backpropagation-through-time  algorithm  is  preferable  to  the  on-line  recursive  backpropagation  dgorithm 
when  (T  -f  \){N  -I-  M)  <  (L  -f  l)ArQ.  For  almost  all  dynamic  neural  network  systems,  this  inequality  will 
hold,  and  the  storage  requirement  will  favor  the  on-line  backpropagation-through-time  algorithm. 


8  Reducing  On-Line  Computational  Complexity 

Both  the  on-line  backpropagation-through-time  and  on-line  recursive  backpropagation  algorithm  are  com- 
putationdly  expensive.  In  this  section,  two  techniques  for  reducing  the  number  of  computations  are  briefly 
presented. 

8.1  Feeding  Back  the  Desired  Response 

We  have  already  stated  that  in  the  on-line  case,  it  is  conunon  to  minimize  the  square  error  at  each  iteration. 
In  this  case,  a  desired  response  vector,  dfc,  must  be  available  at  eeich  iteration.  One  method  of  speeding-up 
on-line  learning,  is  to  feed  back  the  desired  response  instead  of  the  output  vector.  Using  this  technique,  the 
system  equation  is 


yt  =  /(Rt,Dt.Wt(A:))  (47) 

where  D*  =  [dt_i.dL2>  •  •  •  <^k-L]  ^  Because  the  system  defined  by  Equation  47  is  independent 

of  previous  states  of  the  system  and  therefore  static,  the  error  gradient  cem  be  calculated  using  a  single 
backpropagation.  Obviously,  this  technique  is  computationally  less  expensive  than  the  two  on-line  algorithms 
of  Section  5. 

However,  a  price  is  to  be  paid  for  using  this  method.  An  approximate  error  gradient  is  calculated,  whose 
validity  depends  upon  the  magnitude  of  the  difference  between  the  output  and  desire  response  vectors  of  the 
previous  iterations.  If  these  vectors  are  significantly  different,  a  poor  approximation  of  the  error  gradient 
is  used  to  update  the  weights.  Thus,  even  though  the  calculations  per  iteration  are  reduced,  the  nutnber 
of  iteration  required  to  reach  convergence  will  probably  increase.  Despite  the  increase  in  the  number  of 
iterations,  feeding  back  the  desired  response  is  a  method  which  can  greatly  decrease  the  computationally 
complexity  per  iteration. 
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Feeding  back  the  desired  response  has  been  extensively  studied  in  the  field  of  adaptive  signal  processing  [9]. 
Tliis  technique,  which  heis  been  used  to  adapt  single  output  linear  filters,  is  known  as  the  output-error 
formulation.  A  detailed  analysis  of  the  advantages  and  disadvantages  of  using  this  method  for  an  adaptive 
linear  filter  can  be  found  in  Shynk,  1989  [14]. 

8.2  Redefining  the  On-Line  Error  Function 

A  second  technique,  which  reduces  the  computational  requirements  while  slightly  increasing  the  time  to 
convergence  when  using  the  on-line  backpropagation-through-time  algorithm,  is  based  upon  redefining  the  on¬ 
line  error  function.  In  the  on-line  case,  the  error  function  is  commonly  the  squared  error.  In  order  to  calculate 
the  error  gradient  using  the  on-line  backpropagation-through-time  algorithm,  T  backpropagations  through 
the  system  are  required  at  each  iteration.  However,  by  changing  the  error  function,  the  computational 
complexity  of  the  algorithm  can  be  significantly  reduced.  Instead  of  using  the  squared  error,  the  error 
function  can  be  redefined  as 


^  if  (*'  mod  C)  =  0 

dw(k')  y  0  otherwise 

where  C  is  an  integer  constant  greater  than  1,  and  k'  denotes  the  forward  iteration  count.  Using  this 
definition,  the  error  is  nonzero  every  C  iterations.  Thus,  the  error  gradient  need  only  be  calculated  every  C 
iterations  instead  of  every  iteration.  For  example,  if  C  =  10,  the  number  of  computations  is  approximately 
reduced  by  a  factor  of  10,  assuming  the  window  length  T  is  not  drastically  increased.  In  general,  the  window 
length  should  be  increased  to  T  -F  C,  and  the  learning  rate,  p,  should  be  multiplied  C.  The  reduction  in 
computational  requirements  is  accomplished  by  grouping  the  square  error  gradient  calculations.  Even  though 
the  number  of  computations  is  reduced,  the  number  of  iterations  to  convergence  may  increase  because  it 
may  not  be  possible  to  multiply  p  by  a  factor  of  C  for  stability  reasons.  A  more  detailed  explanation  of  this 
technique  of  reducing  the  on-line  computations  can  be  found  in  Williams  and  Peng,  1989  [15]. 

This  method  of  speeding-up  on-line  learning  can  only  be  used  for  the  on-line  Euler- Lagrange  based  algo¬ 
rithm.  Because  the  on-line  recursive  algorithm  calculates  the  output  gradients  at  eeu:h  iteration,  redefining 
the  error  gradient  as  shown  in  Equation  48  does  not  significantly  change  the  computational  complexity  of 
the  algorithm. 


9  Examples 

In  this  section,  two  examples  which  illustrate  the  uses  of  the  dynamic  system  training  algorithms  are  pre¬ 
sented.  The  first  example  demonstrates  the  use  of  the  algorithms  for  nonlinear  controller  design.  A  neural 
network  is  trained  using  the  Euler-Lagrange  based  algorithm  to  provide  the  steering  angle  of  a  boat  which 
is  placed  in  a  river  with  a  nonlinear  current.  By  providing  the  proper  steering  angle,  the  neural  network 
guides  the  boat  across  the  river  to  a  designated  dock  position.  The  second  example  illustrates  the  use  of 
the  on-line  recursive  algorithm  for  adaptive  filtering.  In  this  example,  an  adaptive  noise  cancelling  system 
is  trained  to  eliminate  filtered  noise  from  a  corrupted  signal. 

9.1  Nonlinear  Control  Example 

In  this  example,  a  boat  is  initially  placed  in  a  river,  which  is  200  feet  wide,  within  a  region  100  feet  upstream 
or  downstream  of  a  dock.  The  boat  is  powered  by  a  constant  thrust  motor  which  is  also  used  to  point  the 
boat  in  any  desired  direction.  Starting  from  the  initial  position,  it  is  desired  to  maneuver  the  boat  to  a 
dock,  which  is  located  on  one  shore  of  the  river.  Maneuvering  the  boat  to  the  dock  is  made  difficult  by  the 
stream’s  nonlinear  current. 
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Let  i*  denote  the  distance  from  the  center  of  the  boat  to  the  shore  with  the  dock  at  iteration  k.  Let  i/i 
denote  the  distance  of  the  center  of  the  boat  upstream  or  downstream  of  the  dock.  Assuming  the  current 
only  to  be  a  function  of  the  distance  from  the  shore,  xt,  the  equations  of  motion  for  the  boat  are 


=  Ik  +  lOcos(uk)  (48) 

I/t+i  =  yfc  +  lOsm(ut)  +  fc(xk). 

where  u*,  the  orientation  of  the  boat  given  in  radians,  is  the  control  signal,  and  fc(rk),  the  influence  of  the 
current  on  the  boat,  is  given  in  feet  per  iteration.  The  current,  which  is  parabolic  in  nature  with  the  greatest 
force  in  the  middle  of  the  stream  at  i  =  100,  is  given  by  the  following  equation 

/,(.*)  =  7.5  (ii  -  . 

The  control  signal  is  supplied  by  the  output  of  a  three  layer  neural  network.  The  first  layer  contains  the 
two  inputs,  x*  and  y*,  which  are  the  states  of  the  system.  The  hidden  layer  contains  ten  sigmoidal  neurons 
which  2ure  fully  connected  to  the  inputs  and  a  bias.  The  output  layer,  which  is  linear,  is  fully  connected  to 
the  hidden  layer  and  the  bias. 

The  boat  system  operates  in  an  epochwise  manner  with  the  initial  position  determined  randomly  and 
the  final  position  specified  as  the  iteration  prior  to  the  boat  hitting  the  dock’s  shore.  For  this  reason,  one  of 
the  two  epochwise  algorithms  should  be  used  to  train  the  neural  controller.  Because  of  the  computational 
efficiency  of  the  epochwise  Euler-Lagrange  based  algorithm,  it  was  selected  for  training  the  controller.  In 
order  to  make  the  boat  come  near  to  the  dock  at  the  final  iteration,  the  following  error  function  was  used 

E  =  ixi-Zk,f  ■\-{y<i-yk,f 

where  xj  is  the  x  position  of  the  dock  and  yj  is  the  y  position  of  the  dock. 

In  order  to  train  the  neural  controller,  4000  thousand  training  epochs  were  required  with  a  learning  rate, 
H  =  0.0001.  After  training,  four  demonstration  epochs,  which  are  shown  in  Figure  8,  were  run.  In  the 
lower  portion  of  Figure  8,  the  current  is  shown  as  a  function  of  x.  In  order  to  show  the  boat  graphically, 
it  was  necessary  to  move  the  two  shores  outward  a  distance  equal  to  half  the  boat  length.  For  this  reason, 
the  current  near  both  shores  is  shown  as  zero.  The  four  demonstration  epochs  show  that  by  using  the 
Euler-Lagrange  based  algorithm,  it  is  possible  to  design  a  neural  controller  for  the  boat  system. 

9.2  Adaptive  Filtering  Example 

In  this  example,  an  adaptive  noise  cancelling  system  was  used  to  reduce  additive  noise  from  a  corrupted 
signtd.  Before  getting  into  the  details  of  this  example,  the  adaptive  noise  cancelling  concept  is  introduced. 
Whenever  an  adaptive  noise  cancelling  system  is  to  be  used,  it  is  assumed  that  it  possible  to  detect  a  noise 
source,  ri,,  which  corrupts  the  original  signal,  s^.  Furthermore,  it  is  assumed  that  a  filter  version  of  the 
noise,  nt,  corrupts  the  original  signal.  Finally,  it  is  assumed  that  the  noise  signal  and  the  original  signal  are 
uncorrelated.  The  adaptive  noise  cancelling  system  receives  as  input  the  noise  signal,  rt,  and  the  corrupted 
signal,  Sk  +  Tiis-  In  order  to  eliminate  the  filtered  noise  from  the  corrupted  signal,  the  noise  signal  is  adaptively 
filtered  and  the  result,  yt,  is  subtracted  from  the  corrupted  signal.  If  the  adaptive  filter  is  appropriately 
trained  so  that  yt  =  rik,  this  subtraction  will  result  in  the  output  of  the  noise  cancelling  system,  (t,  being 
equal  to  the  original  signal.  Figure  9  shows  an  illustration  of  the  basic  noise  cancelling  system. 

We  have  stated  earlier  that  an  on-line  error  function  of  the  form  (d*  —  y*)*  minimizes  E[(di  -  y*)"]  For 
the  adaptive  noise  cancelling  system,  we  select  an  error  function  of  the  form  fj.  Therefore,  on-line  adaptation 
of  the  system  results  in  the  minimization  of  E[e^].  We  can  find  this  quantity  by  expanding  the  expected 
values  of  ct  as  follows 
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Figure  8:  Nonlinear  Control  Example. 


E[€j]  =  E[(#*  +  nk-y»)']  (49) 

=  E(aJ+2«*(nk-yt)*  +  (nt-yt)*].  (50) 

Assuming  the  original  signal  is  uncorrelated  with  the  noise  signal  and  the  adaptive  filter  output,  Equation  50 
can  be  written  as 

E(€j]  =  ElsJ]  +  E[(n»  -  y*)2].  (51) 

Minimization  of  Equation  51  requires  that  n*  =  y*.  Therefore,  by  using  an  on-line  error  function  of  the 
form  Ek  =  Ct'  noise  is  adaptively  eliminated  from  the  corrupted  signal  by  the  adaptive  noise  cancelling 
system.  For  a  more  detail  discussion  of  the  adaptive  noise  cancelling  concept,  see  Widrow  and  Stearns  [9]. 

In  our  example,  the  original  signal  was 


Sk  ~  .25cos(.4fc).  (52) 

The  noise  signal,  r*,  is  selected  randomly  from  a  uniform  distribution  between  -1.0  and  1.0.  The  filtered 
noise,  nib.  is  calculated  using  the  following  nonlinear  difference  equation 
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System 

Output 


Figure  9:  Adaptive  Noise  Cancelling  System. 


+  /„(nt_i) 

where 

/„(»..,)  =  (-("*-■-  10)^^  +  .5„p  + 1.0)^) 

It  should  be  noted  that  the  noise  filter  contained  nonlinear  feedback. 

The  adaptive  filter  was  implemented  by  a  three  layer  feedforward  neural  network.  The  input  layer  was 
composed  of  two  components,  the  noise  signal,  r*,  and  the  previous  output  of  the  adaptive  filter,  yk-i-  The 
hidden  layer  was  composed  of  17  hidden  units  each  of  which  were  squashed  by  the  sigmoidal  function.  The 
first  five  nodes  were  connected  through  five  different  weights  to  the  noise  signal.  The  remaining  ten  nodes 
were  connected  to  the  feedback  signal,  yk-i-  In  addition,  each  of  the  hidden  units  were  connect  to  a  bias. 
The  output  layer  contained  one  linear  unit  which  was  connected  to  the  hidden  nodes  and  the  bias  through 
18  separate  weights. 

One  of  the  primary  reasons  for  selecting  the  adaptive  noise  cancelling  system  as  an  example  is  that  the 
feedback  adaptive  filter  described  above  can  only  be  trained  using  one  of  the  on-line  learning  algorithms 
discussed  in  Section  5.  The  speed-up  technique  of  feeding  back  the  desired  response  cannot  be  used  for 
this  example  because  a  desired  response  does  not  exist.  The  on-line  recursive  gradient  update  2tlgorithm 
was  selected  for  training  the  adaptive  filter  because  it  is  computationally  more  efficient  than  the  on-line 
Euler-Lagrange  based  algorithm  when  the  number  of  outputs,  N,  and  the  number  of  delays,  L,  are  both 
equal  to  1. 

A  learning  curve  for  the  system,  with  the  learn  rate,  fi  =  .005,  and  the  forgetting  factor,  a  =  .95,  is 
shown  in  Figure  10.  The  initial  decrease  in  the  mean  squared  error  over  the  first  couple  hundred  iterations  is 
due  to  learning  the  feedforward  component  of  the  filter.  The  slow  learning,  which  lasts  for  several  thousand 
iterations,  is  due  to  learning  the  feedback  component.  The  corrupted  signal,  s^  +  n^,  and  the  original 
signal,  Sk,  for  iterations  5900-6000  are  shown  in  Figure  11.  Notice  that  it  is  impossible  to  determine  the 
characteristics  of  the  original  signal  from  the  corrupted  signal.  The  output  signal,  (t,  and  original  signal, 
st,  for  these  same  iterations  are  shown  in  Figure  12.  Although  the  output  signal  is  not  perfect,  the  noise 
has  been  significantly  reduced. 


(53) 

(54) 
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10  Conclusion 

The  training  of  discrete-time  dynamic  systems  using  first-order  gradient  descent  can  be  accomplished  using 
either  the  Euler-Lagrange  based  algorithm  or  the  recursive  gradient  update  algorithm.  Both  these  algorithms 
have  been  derived  in  this  paper  using  the  notation  of  the  standard  representation.  Epochwise  training  can 
be  accomplished  using  either  of  the  two  epochwise  training  algorithms  which  have  been  shown  to  produce 
identical  weight  updates.  In  general,  because  of  both  computational  and  storage  requirements,  the  Euler- 
Lagrange  based  algorithm  is  preferable  for  epochwise  training.  However,  the  epochwise  recursive  algorithm 
may  be  desirable  in  cases  where  constant  memory  size  is  required.  The  two  on-line  algorithms  produce 
approximately  the  same  weight  updates  at  each  iteration.  In  general,  the  selection  of  an  on-line  algorithm 
is  determined  by  the  number  of  outputs,  N,  of  the  dynamic  system.  As  this  number  increases,  it  becomes 
increasingly  computationally  expensive  to  use  the  recursive  algorithm.  Therefore,  for  large  N,  the  Euler- 
Lagrange  based  algorithm  is  preferable  for  on-line  training.  Both  on-line  algorithms  are  computationally 
expensive.  One  method  of  reducing  the  computations  is  to  feedback  the  desired  responses,  if  they  are 
available.  Another  method,  which  is  applicable  only  to  the  Euler-Lagrange  based  technique,  is  to  redefine 
the  error  function.  Finally,  two  examples  which  illustrate  the  usefulness  of  the  algorithms  are  presented. 
The  first  demonstrates  the  use  of  the  Euler-Lagrange  based  algorithm  for  designing  nonlinear  state  feedback 
controllers.  The  second  illustrates  the  necessity  of  on-line  algorithms  in  certain  aidaptive  filtering  problems. 
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